1 Climate change and temperature anomalies

If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here.

To define temperature anomalies we need to have a reference, or base, period which NASA clearly states is the period between 1951-1980.

We run the code below to load the file:

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")

Notice that, when using this function, we added two options: skip and na. The skip=1 option is there as the real data table only starts in row 2, so we need to skip one row. na = "***" option informs R how missing observations in the spreadsheet are coded. When looking at the spreadsheet, we can see that missing data is coded as "***." It is best to specify this here, as otherwise some of the data is not recognized as numeric data.

For each month and year, the data frame shows the deviation of temperature from the normal (expected). Furthermore, the data frame is in wide format.

We have two objectives in this section:

First is to select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.).

tidyweather <- weather %>% 
  select(c(1:13))

Second is to convert the data frame from wide to ‘long’ format.

tidyweather <- tidyweather %>% 
  pivot_longer(cols = 2:13, names_to = "Month", values_to = "delta")

1.1 Plotting Information

Let us plot the data using a time-series scatter plot, and add a trend line. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.

tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), Month, "1")),
         month = month(date, label=TRUE),
         year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (subtitle = "Weather Anomalies",
        title = "Warmer Times",
        caption = "Source: NASA",
        x = "Year",
        y = "Temperature Deviation")

Using facet_wrap() to produce a separate scatter plot for each month, again with a smoothing line, we can examine if the effect of increasing temperature is more pronounced in some months. Yes, based on the charts above, the effect of increasing temperature is more pronounced in a few months. For example, the effect seems more pronounced in the generally winter months of October through April than it does in the generally summer months of May through September.

It is sometimes useful to group time into different periods to study historical data. For example, we often use decades such as 1970s, 1980s, 1990s, etc. to refer to a period of time. The code below creates a new data frame called comparison that groups data into five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.

We remove data before 1800 using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"))

Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in.

ggplot(comparison, aes(x=delta, fill=interval))+
  geom_density(alpha=0.2) +
  theme_bw() +
  labs (title = "Density Plot for Monthly Temperature Anomalies",
        subtitle = "By Time Interval",
        caption = "Source: NASA",
        y = "Density",
        x = "Delta",
        fill = "Interval")

So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.

average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%
  summarise(annual_average_delta = mean(delta, na.rm=TRUE)) 
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
  geom_point()+
  geom_smooth() +
  theme_bw() +
  labs (subtitle = "Average Yearly Anomaly",
        title = "Warmer Times",
        caption = "Source: NASA",
        y = "Average Annual Delta")

1.2 Confidence Interval for delta

NASA points out on their website that > A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

Now we will construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.

formula_ci <- comparison %>% 
  drop_na() %>% 
  filter(interval == "2011-present") %>% 
  group_by(Year) %>% 
  summarise(mean_delta = mean(delta),
            SD_delta = sd(delta),
            count_delta = n(),
            t_critical = qt(0.975, count_delta-1),
            SE_delta = sd(delta) / sqrt(n()),
            margin_delta = t_critical * SE_delta,
            delta_low = mean_delta - margin_delta,
            delta_high = mean_delta + margin_delta)
formula_ci
## # A tibble: 9 x 9
##    Year mean_delta SD_delta count_delta t_critical SE_delta margin_delta
##   <dbl>      <dbl>    <dbl>       <int>      <dbl>    <dbl>        <dbl>
## 1  2011      0.7      0.103          12       2.20   0.0298       0.0655
## 2  2012      0.765    0.160          12       2.20   0.0462       0.102 
## 3  2013      0.753    0.111          12       2.20   0.0321       0.0706
## 4  2014      0.9      0.140          12       2.20   0.0405       0.0891
## 5  2015      1.13     0.163          12       2.20   0.0470       0.103 
## 6  2016      1.28     0.326          12       2.20   0.0940       0.207 
## 7  2017      1.13     0.213          12       2.20   0.0616       0.136 
## 8  2018      0.989    0.158          12       2.20   0.0455       0.100 
## 9  2019      1.12     0.163           7       2.45   0.0616       0.151 
## # … with 2 more variables: delta_low <dbl>, delta_high <dbl>
boot_delta <- comparison %>% 
  filter(interval == "2011-present") %>%
  specify(response = delta) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")
percentile_ci <- boot_delta %>% 
  get_confidence_interval(level = 0.95, type = "percentile")
visualize(boot_delta) + 
  shade_ci(endpoints = percentile_ci,fill = "khaki")+
  labs(subtitle = "Bootstrap CI for Weather Delta",
       title = "Bootstrapping Delta",
       caption = "Source: NASA",
       y = "Count",
       x = "Average Annual Delta") +
  theme_bw()

The bootstrap distribution is showing us a simulation of the average annual temperature anomalies for the period 2011 to present. Because there are only eight years and thus eight observations included in that data set, any confidence interval we establish will be very broad. Therefore, bootstrapping allows us to sample with replacement and thus create a larger data set that can establish a narrower confidence interval. The beige area indicated on the sampling distribution above represents the 95% confidence interval based on the bootstrapped sample. The result is that we can say with 95% confidence that the true average annual delta for the period 2011 to present lies between roughly 0.92 and 1.02.

2 General Social Survey (GSS)

The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends in attitudes, behaviors, and attributes. Many trends have been tracked for decades, so one can see the evolution of attitudes, etc. in American society.

Here, we analyze data from the 2016 GSS sample data, using it to estimate values of population parameters of interest about US adults. The GSS sample data file has 2,867 observations of 935 variables, but we are only interested in very few of these variables.

gss <- read_csv(here::here("data", "smallgss2016.csv"), 
                na = c("", "Don't know",
                       "No answer", "Not applicable"))

Many responses will not be taken into consideration, like “No Answer,” “Don’t Know,” “Not applicable,” and “Refused to Answer.”

We will be creating 95% confidence intervals for population parameters. The variables we have are the following:

  • hours and minutes spent on email weekly. The responses to these questions are recorded in the emailhr and emailmin variables. For example, if the response is 2.50 hours, this would be recorded as emailhr = 2 and emailmin = 30.
  • snapchat, instagrm, twitter: whether respondents used these social media in 2016
  • sex: Female - Male
  • degree: highest education level attained

2.1 Instagram and Snapchat, by sex

Can we estimate the population proportion of Snapchat or Instagram users in 2016?

First, we will create a new variable, snap_insta that is Yes if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and No if not. If the recorded value was NA for both of these questions, the value in the new variable will also be NA.

gss_v2 <- gss %>% 
  na_if("NA") %>% 
  mutate(snap_insta=ifelse(snapchat == "NA" & instagrm == "NA", "NA", 
                      ifelse(snapchat == "Yes" | instagrm == "Yes", "Yes", "No")))
gss_v2
## # A tibble: 2,867 x 8
##    emailmin emailhr snapchat instagrm twitter sex    degree         snap_insta
##    <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <chr>          <chr>     
##  1 0        12      <NA>     <NA>     <NA>    Male   Bachelor       <NA>      
##  2 30       0       No       No       No      Male   High school    No        
##  3 <NA>     <NA>    No       No       No      Male   Bachelor       No        
##  4 10       0       <NA>     <NA>     <NA>    Female High school    <NA>      
##  5 <NA>     <NA>    Yes      Yes      No      Female Graduate       Yes       
##  6 0        2       No       Yes      No      Female Junior college Yes       
##  7 0        40      <NA>     <NA>     <NA>    Male   High school    <NA>      
##  8 <NA>     <NA>    Yes      Yes      No      Female High school    Yes       
##  9 0        0       <NA>     <NA>     <NA>    Male   High school    <NA>      
## 10 <NA>     <NA>    No       No       No      Male   Junior college No        
## # … with 2,857 more rows

Second, we will calculate the proportion of Yes’s for snap_insta among those who answered the question, excluding NAs.

gss_v3 <- gss_v2 %>%
  filter(! is.na(snap_insta)) %>%
  mutate(user = ifelse(snap_insta == "Yes", 1, 0)) %>% 
  summarise(total = length(user), answer_yes = sum(user)) %>%
  mutate(prop = answer_yes / total) 
gss_v3
## # A tibble: 1 x 3
##   total answer_yes  prop
##   <int>      <dbl> <dbl>
## 1  1372        514 0.375

Third, using the CI formula for proportions, we will construct 95% CIs for men and women who used either Snapchat or Instagram.

gss_v4 <- gss_v2 %>% 
  filter(!is.na(snap_insta)) %>%
  mutate(user = ifelse(snap_insta == "Yes", 1, 0)) %>%
  group_by(sex) %>%
  summarise(total = length(user), answer_yes = sum(user)) %>% 
  mutate(prop = answer_yes / total)
gss_v5 <- gss_v4 %>% 
  group_by(sex) %>%
  summarise(lower_95 = prop - qt(0.975, total - 1) * sqrt((prop * (1 - prop)) / total), 
            upper_95 = prop + qt(0.975, total - 1) * sqrt((prop * (1 - prop)) / total))
gss_v5
## # A tibble: 2 x 3
##   sex    lower_95 upper_95
##   <chr>     <dbl>    <dbl>
## 1 Female    0.384    0.454
## 2 Male      0.281    0.356

2.2 Twitter, by education level

Now we will estimate the population proportion of Twitter users by education level in 2016.

There are 5 education levels in variable degree which, in ascending order of years of education, are Lt high school, High School, Junior college, Bachelor, Graduate.

First, we will turn degree from a character variable into a factor variable, making sure the order is correct and that levels are not sorted alphabetically.

gss_degree <- gss %>% 
  na_if("NA") %>% 
  mutate(degree = factor(degree, levels = c("Lt high school", "High School", "Junior college", "Bachelor", "Graduate"))) %>%
  filter(!is.na(degree))
str(gss_degree$degree)
##  Factor w/ 5 levels "Lt high school",..: 4 4 5 3 3 1 1 5 1 1 ...

Next we will create a new variable, bachelor_graduate, that is Yes if the respondent has either a Bachelor or Graduate degree. As before, if the recorded value for either was NA, the value in the new variable will also be NA.

bachelor_graduate <- gss_degree %>% 
  mutate(bachelor_graduate = ifelse(degree == "", "",
                      ifelse(degree == "Bachelor"|degree == "Graduate", "Yes", "No")))
bachelor_graduate
## # A tibble: 1,398 x 8
##    emailmin emailhr snapchat instagrm twitter sex    degree     bachelor_gradua…
##    <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <fct>      <chr>           
##  1 0        12      <NA>     <NA>     <NA>    Male   Bachelor   Yes             
##  2 <NA>     <NA>    No       No       No      Male   Bachelor   Yes             
##  3 <NA>     <NA>    Yes      Yes      No      Female Graduate   Yes             
##  4 0        2       No       Yes      No      Female Junior co… No              
##  5 <NA>     <NA>    No       No       No      Male   Junior co… No              
##  6 30       0       No       No       No      Male   Lt high s… No              
##  7 0        1       <NA>     <NA>     <NA>    Female Lt high s… No              
##  8 0        1       Yes      No       Yes     Male   Graduate   Yes             
##  9 <NA>     <NA>    No       Yes      No      Female Lt high s… No              
## 10 0        1       Yes      Yes      No      Female Lt high s… No              
## # … with 1,388 more rows

Then we will calculate the proportion of bachelor_graduate who do (Yes) and who don’t (No) use Twitter.

proportion_bachelor_graduate <- bachelor_graduate %>% 
  filter(!is.na(twitter)) %>% 
  filter(bachelor_graduate == "Yes") %>% 
  group_by(twitter) %>% 
  summarise(count = n()) %>%
mutate(prop = count / sum(count))
proportion_bachelor_graduate
## # A tibble: 2 x 3
##   twitter count  prop
##   <chr>   <int> <dbl>
## 1 No        375 0.767
## 2 Yes       114 0.233

Finally, using the CI formula for proportions, we construct two 95% CIs for bachelor_graduate depending on whether they use (Yes) and don’t (No) use Twitter.

CI_bachelor_graduate <- proportion_bachelor_graduate %>% 
  mutate(total = sum(count)) 
CI_bachelor_graduate %>% 
  group_by(twitter) %>%
  summarise(lower_95 = prop - qt(0.975, total - 1) * sqrt((prop * (1 - prop)) / total),
            upper_95 = prop + qt(0.975, total - 1) * sqrt((prop * (1 - prop)) / total))
## # A tibble: 2 x 3
##   twitter lower_95 upper_95
##   <chr>      <dbl>    <dbl>
## 1 No         0.729    0.804
## 2 Yes        0.196    0.271

The two confidence intervals do not overlap.

2.3 Email usage

Can we estimate the population parameter on time spent on email weekly?

First, we will create a new variable called email that combines emailhr and emailmin to reports the number of minutes the respondents spend on email weekly.

gss_email <- gss %>%
  na_if("NA") %>%
  mutate(email = as.numeric(emailhr) * 60 + as.numeric(emailmin))
gss_email
## # A tibble: 2,867 x 8
##    emailmin emailhr snapchat instagrm twitter sex    degree         email
##    <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <chr>          <dbl>
##  1 0        12      <NA>     <NA>     <NA>    Male   Bachelor         720
##  2 30       0       No       No       No      Male   High school       30
##  3 <NA>     <NA>    No       No       No      Male   Bachelor          NA
##  4 10       0       <NA>     <NA>     <NA>    Female High school       10
##  5 <NA>     <NA>    Yes      Yes      No      Female Graduate          NA
##  6 0        2       No       Yes      No      Female Junior college   120
##  7 0        40      <NA>     <NA>     <NA>    Male   High school     2400
##  8 <NA>     <NA>    Yes      Yes      No      Female High school       NA
##  9 0        0       <NA>     <NA>     <NA>    Male   High school        0
## 10 <NA>     <NA>    No       No       No      Male   Junior college    NA
## # … with 2,857 more rows

Next, we will visualize the distribution of this new variable.

gss_mean_median <- gss_email %>%
  select(email) %>% 
  filter(! is.na(email)) %>%
  summarise(mean = mean(email), median = median(email))
gss_mean_median
## # A tibble: 1 x 2
##    mean median
##   <dbl>  <dbl>
## 1  417.    120
gss_email %>% 
  filter(! is.na(email)) %>%
ggplot(aes(x = email)) + 
  geom_density() +
  labs(subtitle = "Distribution of minutes spent on email weekly",
       title = "Snail Mail Still Rules",
       caption = "Source: General Social Survey",
       x = "Minutes spent on email weekly",
       y = "") +
  theme_bw()

As the variable “email” has a heavily right skew distribution, the most appropriate measure of the typical amount of time Americans spend on email is the median as it is less affected by extreme values of the considered variable.

Finally, using the infer package, we will calculate a 95% bootstrap confidence interval for the mean amount of time Americans spend on email weekly.

gss_email_bootstrap <- gss_email %>%
  filter(! is.na(email)) %>%
  specify(response = email) %>% 
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")
gss_email_CI <- gss_email_bootstrap %>%
  get_ci(level = 0.95, type = "percentile")
gss_email_CI
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     386.     449.

Our 95% bootstrap confidence interval shows that, with 95% confidence, the true mean amount of time Americans spend on email weekly falls between the following two values: 6 hours and 27 minutes and 7 hours and 30 minutes. Taking into consideration the graphic distribution of the variable, the result seems to exaggerate the actual amount of time Americans spend checking their emails. We believe the presence of extreme values, combined with the right skew distribution of the variable, can explain the overestimation.

We would expect a 99% confidence interval to be wider because it would need to capture a greater share of the possible values. For this reason, the higher the confidence level, the wider the interval is.

3 Trump’s Approval Margins

As we saw earlier, fivethirtyeight.com has detailed data on all polls that track the president’s approval

approval_polllist <- read_csv(here::here('data', 'approval_polllist.csv'))
glimpse(approval_polllist)
## Rows: 15,619
## Columns: 22
## $ president           <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls…
## $ modeldate           <chr> "9/27/2020", "9/27/2020", "9/27/2020", "9/27/2020…
## $ startdate           <chr> "1/20/2017", "1/20/2017", "1/20/2017", "1/21/2017…
## $ enddate             <chr> "1/22/2017", "1/22/2017", "1/24/2017", "1/23/2017…
## $ pollster            <chr> "Gallup", "Morning Consult", "Ipsos", "Gallup", "…
## $ grade               <chr> "B", "B/C", "B-", "B", "B-", "C+", "B+", "B", "C+…
## $ samplesize          <dbl> 1500, 1992, 1632, 1500, 1651, 1500, 1190, 1500, 1…
## $ population          <chr> "a", "rv", "a", "a", "a", "lv", "rv", "a", "lv", …
## $ weight              <dbl> 0.262, 0.680, 0.153, 0.243, 0.142, 0.200, 1.514, …
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ approve             <dbl> 45.0, 46.0, 42.1, 45.0, 42.3, 57.0, 36.0, 46.0, 5…
## $ disapprove          <dbl> 45.0, 37.0, 45.2, 46.0, 45.8, 43.0, 44.0, 45.0, 4…
## $ adjusted_approve    <dbl> 45.7, 45.3, 43.2, 45.7, 43.4, 51.5, 37.6, 46.7, 5…
## $ adjusted_disapprove <dbl> 43.6, 38.3, 43.9, 44.6, 44.5, 44.5, 42.8, 43.6, 4…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tracking            <lgl> TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE,…
## $ url                 <chr> "http://www.gallup.com/poll/201617/gallup-daily-t…
## $ poll_id             <dbl> 49253, 49249, 49426, 49262, 49425, 49266, 49260, …
## $ question_id         <dbl> 77265, 77261, 77599, 77274, 77598, 77278, 77272, …
## $ createddate         <chr> "1/23/2017", "1/23/2017", "3/1/2017", "1/24/2017"…
## $ timestamp           <chr> "00:45:20 27 Sep 2020", "00:45:20 27 Sep 2020", "…
approval_polllist_fix <- approval_polllist %>% 
 mutate(
        modeldate = mdy(modeldate),
        startdate = mdy(startdate),
        enddate = mdy(enddate),
        createddate = mdy(createddate))

3.1 Create a plot

Using this data we will calculate the average net approval rate (approve - disapprove) for each week since Trump got into office and create a plot that looks like this one.

3.2 Compare Confidence Intervals

Next we will compare the confidence intervals for week 15 (6-12 April 2020) and week 34 (17-23 August 2020).

## # A tibble: 2 x 10
## # Groups:   year [1]
##    year  week mean_approval SD_approval count_approval t_critical SE_approval
##   <dbl> <dbl>         <dbl>       <dbl>          <int>      <dbl>       <dbl>
## 1  2020    15         -8.23        4.38             94       1.99       0.452
## 2  2020    34        -12.7         5.31             84       1.99       0.580
## # … with 3 more variables: margin_approval <dbl>, approval_low <dbl>,
## #   approval_high <dbl>

We can see that confidence interval for week 15 is slightly narrower than that for week 34 due to a higher standard deviation driving a higher standard error. This may be due to a combination of data that was released in mid-August, including a not-improving COVID-19 situation in the United States coupled with disappointing economic data. The wider confidence interval results in a wider array of values over which we can be 95% confident represent the true share of the population that approves of the President.

4 Gapminder revisited

Recall the gapminder data frame from the gapminder package. That data frame contains just six columns from the larger data in Gapminder World. Here, we will join a few data frames with more data than the ‘gapminder’ package. Specifically, we will look at data on:

hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")
library(wbstats)
worldbank_data <- wb_data(country="countries_only",
                          indicator = indicators, 
                          start_date = 1960, 
                          end_date = 2016)
countries <-  wbstats::wb_cachelist$countries

We will join the 3 data frames (life_expectancy, worldbank_data, and HIV) into one. However, we first have to convert the data to tidy format, starting with the life_expectancy data. Next, we will tidy the HIV data and merge it with the life_expectancy data before finally merging it with the worldbank_data.

The first plot shows the relationship between HIV prevalence in a population and life expectancy, grouped by region of the world.

Mutatation_life_expectancy <- life_expectancy %>%
  pivot_longer(cols = 2:300,
               names_to = "year",
               values_to = "life_expectancy") %>% 
  select(!2:3)
Mutation_HIV_prevalence <- hiv %>%
  pivot_longer(cols = 2:34,
               names_to = "year",
               values_to ="HIV_prevalence")
Combined_data1 <- left_join(Mutatation_life_expectancy, Mutation_HIV_prevalence, by = c("country","year"))
Combined_data2 <- left_join(Combined_data1, countries, by = "country")
Combined_data3 <- Combined_data2 %>% 
  select(!5:11) %>% 
  select(1:5)
Combined_hiv <- Combined_data3 %>% 
  filter(!HIV_prevalence %in% NA) %>% 
  filter(!region %in% NA)
ggplot(Combined_hiv, aes(x = HIV_prevalence, y = life_expectancy)) +
  geom_point() +
  geom_smooth(colour = "red", method = "lm") +
  facet_wrap(~region) + 
  scale_x_log10() +
  theme_bw() +
    labs(subtitle = "Relationship Between HIV Prevalence and Life Expectancy",
         title = "HIV Around the World",
         x = "HIV Prevalence",
         y = "Life Expectancy")

In general, there appears to only be a minor negative relationship between HIV prevalence and life expectancy. Because this data set covers the entire history of HIV, it reflects the impressive strides that humanity has made in improving patient outcomes for those living with the disease, even in the poorest and worst affected regions of the world.

This next plot examines the relationship between fertility rate and GDP per-capita.

Combined_data4 <- left_join(worldbank_data, countries, by = "country")
Combined_gdp <- Combined_data4 %>% 
  filter(!NY.GDP.PCAP.KD %in% NA,
         !SP.DYN.TFRT.IN %in% NA)
ggplot(Combined_gdp, aes(x = SP.DYN.TFRT.IN, y = NY.GDP.PCAP.KD)) +
  geom_point() +
  geom_smooth(colour="red", method = "lm") +
  scale_y_log10(labels = dollar) +
  facet_wrap(~region) +
  theme_bw() +  
    labs(subtitle = "Fertility Rate and GDP per Capita",
         title = "Who has Babies?",
         caption = "Source: World Bank",
         x = "Fertility Rate",
         y = "GDP per Capita")

There is a very consistent negative relationship between fertility rate and GDP per capita. This matches with conventional wisdom and observations around the world in which richer countries have fewer children per woman as women enter the workforce and raising children becomes more expensive.

Returning to HIV data, below is a bar chart of the number of missing observations for HIV data, grouped by region.

Combined_missing <- Combined_data3 %>%
  filter(year >= 1979, is.na(HIV_prevalence), !region %in% NA) %>% 
  group_by(region) %>%
  summarise(count = count(country))
ggplot(Combined_missing, aes(x = count,y = reorder(region, count))) +
  geom_col() + 
  theme_bw() +
  labs(subtitle = "Missing Data Observations",
       title = "Data Deficiency",
       x = "Count",
       y = "")

Below are two sets of charts, one showing the countries that have made the largest improvement in child mortality by region and the other showing those that have made the smallest improvement.

Combined_mortality_top <- Combined_data4 %>%
  filter(!SH.DYN.MORT %in% NA) %>% 
  group_by(country, region) %>%
  mutate(improvement = lag(SH.DYN.MORT) - SH.DYN.MORT) %>% 
  filter(!improvement %in% NA) %>% 
  summarize(improvement = sum(improvement)) %>% 
  group_by(region) %>% 
  slice_max(order_by = improvement, n = 5) %>% 
  ungroup %>% 
  mutate(region = as.factor(region),
         country = reorder_within(country, improvement, region))
ggplot(Combined_mortality_top, aes(x = country, y = improvement)) +
    geom_col() +
    facet_wrap(~region, scales = "free") +
    coord_flip() +
    scale_x_reordered() +
    scale_y_continuous() +
    theme_bw() +
    labs(title = "Save the Children", subtitle = "Largest Child Mortality Rate Improvement", x = "", y = "Rate Improvement", caption = "Source: World Bank")

Combined_mortality_bottom <- Combined_data4 %>%
  filter(!SH.DYN.MORT %in% NA) %>% 
  group_by(country, region) %>%
  mutate(improvement = lag(SH.DYN.MORT) - SH.DYN.MORT) %>% 
  filter(!improvement %in% NA) %>% 
  summarize(improvement = sum(improvement)) %>% 
  group_by(region) %>% 
  slice_min(order_by = improvement, n = 5) %>% 
  ungroup %>% 
  mutate(region = as.factor(region),
         country = reorder_within(country, improvement, region))
ggplot(Combined_mortality_bottom, aes(x = country, y = improvement)) +
    geom_col() +
    facet_wrap(~region, scales = "free") +
    coord_flip() +
    scale_x_reordered() +
    scale_y_continuous() +
    theme_bw() +
    labs(title = "Save the Children", subtitle = "Smallest Child Mortality Rate Improvement", x = "", y = "Rate Improvement", caption = "Source: World Bank")

Sub-Saharan Africa, the Middle East & North Africa, and South Asia appear to have seen the largest wholesale improvements in child mortality rate while Europe has seen the smallest improvement, likely due to starting at a higher base level.

This final chart will examine the relationship between fertility rate and and primary school enrollment.

Combined_primary <- Combined_data4 %>% 
  filter(!SE.PRM.NENR %in% NA, !SP.DYN.TFRT.IN %in% NA)
ggplot(Combined_primary, aes(x = SE.PRM.NENR, y = SP.DYN.TFRT.IN)) +
  geom_point() +
  geom_smooth(colour = "red", method = "lm") +
  facet_wrap(~region) +
  theme_bw() +
  labs(subtitle = "Relationship between Fertility rate and Primary school enrollment", 
       title = "Educate the Children",
       caption = "Source: World Bank",
       x = "Primary School Enrollment", 
       y = "Fertility Rate")

Similar to the relationship between fertility rate and GDP per capita, there is a strong negative correlation between primary school enrollment and fertility rate. A possible explanation is that the ability of a country to provide high levels of primary school education is strongly tied to its wealth, thus carrying the tie between GDP per capita and fertility to this relationship as well.

5 Challenge 1: CDC COVID-19 Public Use Data

Next we will revisit the CDC Covid-19 Case Surveillance Data.

url <- "https://data.cdc.gov/api/views/vbim-akqf/rows.csv?accessType=DOWNLOAD"
covid_data <- vroom::vroom(url)%>%
  clean_names()

Given the data we have, we will produce two graphs that show death % rate, first by age group, sex, and whether the patient had co-morbidities or not.

covid_graph <- covid_data %>% 
select(medcond_yn, death_yn, sex, age_group) %>%
  filter(!medcond_yn %in% c("Missing", "Unknown", "Other", NA))%>% 
  filter(!sex %in% c("Missing", "Unknown", "Other", NA)) %>% 
  filter(!age_group %in% c("Missing", "Unknown", "Other", NA)) %>% 
  filter(!death_yn %in% c("Missing", "Unknown", "Other", NA)) %>% 
  mutate(death_yn = ifelse(death_yn == "Yes", 1, 0))%>% 
  mutate(medcond_yn = ifelse(medcond_yn == "Yes", "With comorbidities", "Without comorbidities")) %>% 
  group_by(age_group, sex, medcond_yn) %>% 
  summarise(death = prop(death_yn))
ggplot(covid_graph, aes(x = (death * 100), y = age_group)) +
  geom_col(fill = "dark blue") +
  facet_grid(rows = vars(medcond_yn), cols = vars(sex)) +
  geom_text(aes(label = round(death * 100, 1)), position = position_dodge(width = 1), hjust = -0.1) +
  labs(title = "Covid death % age group, sex and presence of co-morbidities",
       y = "", 
       x = "Percentage of Deaths in %",
       caption = "Source: CDC") +
  theme_bw()

And second, by age group, sex, and whether the patient was admitted to Intensive Care Unit (ICU) or not.

covid_graph2 <- covid_data %>% 
select(icu_yn, death_yn, sex, age_group) %>%
  filter(!icu_yn %in% c("Missing", "Unknown", "Other", NA))%>% 
  filter(!sex %in% c("Missing", "Unknown", "Other", NA)) %>% 
  filter(!age_group %in% c("Missing", "Unknown", "Other", NA)) %>% 
  filter(!death_yn %in% c("Missing", "Unknown", "Other", NA)) %>% 
  mutate(death_yn = ifelse(death_yn == "Yes", 1, 0))%>% 
  mutate(icu_yn = ifelse(icu_yn == "Yes", "With comorbidities", "Without comorbidities")) %>% 
  group_by(age_group, sex, icu_yn) %>% 
  summarise(death = prop(death_yn))
ggplot(covid_graph2, aes(x = (death * 100), y = age_group)) +
  geom_col(fill = "dark orange") +
  facet_grid(rows= vars(icu_yn), cols = vars(sex)) +
  geom_text(aes(label = round(death * 100, 1)), position = position_dodge(width = 1), hjust = -0.1) +
  labs(title = "Covid death % age group, sex and presence of co-morbidities",
     y = "", 
     x = "Percentage of Deaths in %",
    caption = "Source: CDC") +
  theme_bw()

Our charts will look similar to these ones.

6 Challenge 2: Excess rentals in TfL bike sharing

Recall the TFL data on how many bikes were hired every single day. We can get the latest data by running the following.

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2020-09-18T09%3A06%3A54/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20201004%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201004T155121Z&X-Amz-Expires=300&X-Amz-Signature=473bb3ee1d5ea4d11dbfbfea79e594edbbe517035d268131dbc33ba131eb9493&X-Amz-SignedHeaders=host]
##   Date: 2020-10-04 15:51
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 165 kB
## <ON DISK>  /var/folders/5x/kd157hc16h91cxk4h0bt5bpm0000gn/T//RtmpiK9I1R/fileb7146ff6ade9.xlsx
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

We can easily create a facet grid that plots bikes hired by month and year. The distribution of bikes hired in May and June 2020 is drastically different from the distributions in previous years. This is clearly the result of the COVID-19 induced lockdowns that were imposed on London during those months.

However, our main challenge is reproducing two charts given below. The first plots the deviation in actual bike rentals from expected bike rentals. The second looks at percentage changes from the expected level of weekly rentals. The two gray shaded rectangles correspond to the second (weeks 14-26) and fourth (weeks 40-52) quarters. We used the median to calculate our expected rentals because it minimizes the effect of extreme values, like one in which a tube strike crippled London’s infrastructure for the day. Using the mean would allow this value to drag the expected number of bikes rented upward even though it is clearly an anomaly from the usual trend.

7 Details

  • Alexander Kirk, Celine Chi, Hans-Christian Preyer, Luca Toraldo, Pablo Carrera Lorenzo, and Yirui Xu all collaborated on this problem set.
  • We spent approximately 15 hours on this problem set.
  • The gapminder task gave us the most trouble, specifically the reorder_within function and trying to reorder from smallest to largest.

8 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.